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Abstract 

C remains central to our computing infrastructure. It is no- 
tionally defined by ISO standards, but in reality the proper- 
ties of C assumed by systems code and those implemented 
by compilers have diverged, both from the ISO standards and 
from each other, and none of these are clearly understood. 

We make two contributions to help improve this error- 
prone situation. First, we describe an in-depth analysis of 
the design space for the semantics of pointers and memory 
in C as it is used in practice. We articulate many specific 
questions, build a suite of semantic test cases, gather experi- 
mental data from multiple implementations, and survey what 
C experts believe about the de facto standards. We identify 
questions where there is a consensus (either following ISO 
or differing) and where there are conflicts. We apply all this 
to an experimental C implemented above capability hard- 
ware. Second, we describe a formal model, Cerberus, for 
large parts of C. Cerberus is parameterised on its memory 
model; it is linkable either with a candidate de facto memory 
object model, under construction, or with an operational Cl 1 
concurrency model; it is defined by elaboration to a much 
simpler Core language for accessibility, and it is executable 
as a test oracle on small examples. 

This should provide a solid basis for discussion of what 
mainstream C is now: what programmers and analysis tools 
can assume and what compilers aim to implement. Ulti- 
mately we hope it will be a step towards clear, consistent, 
and accepted semantics for the various use-cases of C. 

Categories and Subject Descriptors F.3.2 [Semantics of 
Programming Languages ] 

Keywords C 


1. Introduction 

C, originally developed 40+ years ago, remains one of the 
central abstractions of our computing infrastructure, widely 
used for systems programming. It is notionally defined by 
ANSI/ISO standards, C89/C90, C99 and Cl 1, and a number 
of research projects have worked to formalise aspects of 
these [3, 16, 18, 25-29, 38—40] . But the question of what C 
is in reality is much more complex, and more problematic, 
than the existence of an established standard would suggest; 
formalisation alone is not enough. 

Problem 1: the de facto standards vs the ISO standard 

In practice, we have to consider the behaviours of the vari- 
ous mainstream C compilers, the assumptions that systems 
programmers make about the behaviour they can rely on, the 
assumptions implicit in the corpus of existing C code, and 
the assumptions implicit in C analysis tools. Each of these de 
facto standards is itself unclear and they all differ, from each 
other and from the ISO standard, despite the latter’s stated 
intent to “codify the common, existing definition of C”. 

This is not just a theoretical concern. Over time, C im- 
plementations have tended to become more aggressively op- 
timising, exploiting situations which the standards now re- 
gard as undefined behaviour. This can break code that used 
to work correctly in practice, sometimes with security im- 
plications [47, 48]. As we shall see, critical infrastructure 
code, including the Linux and FreeBSD kernels, depends 
on idioms that current compilers do not in general sup- 
port, or that the ISO standard does not permit. This leads 
to tensions between the compiler and OS-developer commu- 
nities, with conflicting opinions about what C implementa- 
tions do or should guarantee. One often sees large projects 
built using compiler flags, such as -fno-strict-aliasing 
and -fno-strict-overflow, that turn off particular analy- 
ses; the ISO standard does not describe these. 

Meanwhile, developers of static and dynamic analysis 
tools must make ad hoc semantic choices to avoid too many 
“false” positives that are contrary to their customers’ expec- 
tations, even where these are real violations of the ISO stan- 
dards [5]. This knowledge of the de facto standards is often 
left implicit in the tool implementations, and making it pre- 
cise requires an articulation of the possible choices. 

At the same time, in other respects the space of main- 
stream C implementations has become simpler than the one 



that the C standard was originally written to cope with. For 
example, mainstream hardware can now reasonably be as- 
sumed to have 8-bit bytes, twos-complement arithmetic, and 
(often) non-segmented memory, but the ISO standard does 
not take any of this into account. 

Problem 2: no precise and accessible specification Fo- 
cussing on the ISO standard alone, a second real difficulty is 
that it is hard to interpret precisely. C tries to solve a chal- 
lenging problem: to simultaneously enable programmers to 
write high-performance hand-crafted low-level code, pro- 
vide portability between widely varying target machine ar- 
chitectures, and support sophisticated compiler optimisa- 
tions. It does so by providing operations both on abstract val- 
ues and on their underlying concrete representations. The in- 
teraction between these is delicate, as are the dynamic prop- 
erties relating to memory and type safety, aliasing, concur- 
rency, and so on; they have, unsurprisingly, proved difficult 
to characterise precisely in the prose specification style of 
the standards. Even those few who are very familiar with 
the standard often struggle with the subtleties of C, as wit- 
nessed by the committee’s internal mailing list discussion 
of what the correct interpretation of the standard should be, 
the number of requests for clarification made in the form of 
defect reports [51], and inconclusive discussions of whether 
compiler anomalies are bugs w.r.t. the standard. The prose 
standard is not executable as a test oracle, which would let 
one simply compute the set of all allowed behaviours of any 
small test case, so such discussions have to rely on exegesis 
of the standard text. The goal of ANSI C89, to “develop a 
clear, consistent, and unambiguous Standard for the C pro- 
gramming language” [2], thus remains an open problem. 

The obvious semanticist response is to attempt a mathe- 
matical reformulation of the standard, as in the above-cited 
works, but this ignores the realities of the de facto standards. 
Moreover, a language standard, to serve as the contract be- 
tween implementors and users, must be accessible to both. 
Mathematical definitions of semantics are precise, especially 
if mechanised, but (especially mechanised) are typically not 
accessible to practitioners or the C standards community. 

Problem 3: integrating Cll concurrent and sequential 
semantics Finally, there is the technical challenge of deal- 
ing with concurrency. The C standard finally addressed con- 
currency in Cll, following C++ [1, 4, 8], and Batty et al. [3] 
worked to ensure that here, unusually, the standard text is 
written in close correspondence to a formalisation. But that 
concurrency model is in an axiomatic style, while the rest 
of the standard is more naturally expressed in an operational 
semantics; it is hard to integrate the two. 

To summarise, the current state of C is, simply put, a 
mess. The divergence among the de facto and ISO standards, 
the prose form, ambiguities, and complexities of the latter, 
and the lack of an integrated treatment of concurrency, all 
mean that the ISO standard is not at present providing a 


satisfactory definition of C as it is or should be. It also does 
not provide a good basis for designing future refinements 
of C. The scale of the problem, and the real disagreements 
about what C is, mean that there is no simple solution to all 
this, but we take several steps to clarify the situation. 

Contributions Our first contribution is a detailed investi- 
gation into the possible semantics for pointers and memory 
in C (its memory object model), where there seems to be the 
most divergence and variation among the multiple de facto 
and ISO standards. We explore this in four interlinked ways: 

1 . An in-depth analysis of the design space. We identify 85 
questions, supported by 196 hand-written semantic test 
cases, and for each discuss the desired behaviour w.r.t. the 
ISO standard and real-world implementation and usage. 

2. A survey investigating the de facto standards, directly 
probing what systems programmers and compiler writers 
believe about compiler behaviour and extant code. 

3. Experimental data for our test suite, for GCC and Clang 
(for multiple versions and flags), the Clang undefined 
behavior, memory, and address sanitisers, TrustlnSoft’s 
tis-interp refer [44], and KCC [16, 18, 20]. 

4. In ongoing work, we are building a candidate formal 
model capturing one plausible view of the de facto stan- 
dards, as a reference for discussion. 

These feed into each other: the survey responses and experi- 
mental data inform our formal modelling, and the design of 
the latter, which eventually must provide a single coherent 
semantics that takes a position on the allowed semantics of 
arbitrary code, has raised many of the questions. 

Our focus here is on current mainstream C, not on the C 
of obsolete or hypothetical implementations. We aim to es- 
tablish a solid basis for future discussion of the de facto stan- 
dards, identifying places where there is a more-or-less clear 
consensus (either following ISO or differing) and places 
where there are clear conflicts. 

This has already proved useful in practice: we applied our 
analysis and test suite to the C dialect supported by Wat- 
son et al’s experimental CHERI processor [49, 50], which 
implements unforgeable, bounds-checked C pointers (capa- 
bilities). CHERI C’s strong dynamically enforced memory 
safety is expected to be more restrictive than mainstream C, 
but also to support existing systems software with only mod- 
est adaptation. We identified several hardware and software 
bugs and important design questions for CHERI C, building 
on and informing our analysis of the de facto standards. 

For the memory object model, the de facto standard usage 
and behaviour are especially problematic, but many of the 
other subtle aspects of C have not been significantly affected 
by the introduction of abstract notions (of pointer, unspeci- 
fied value, etc.); for these, while there are many ambiguities 
in the ISO standard text and divergences between it and prac- 
tice [5, 43, 47], the text is a reasonable starting point. 

Our second contribution is a formal semantics, Cerberus, 
for large parts of C, in which we aim to capture the ISO 



text for these aspects as clearly as possible. Cerberus is pa- 
rameterised on the memory model, so it can be instantiated 
with the candidate formal model that we are developing or, 
in future, other alternative memory object models. It is ex- 
ecutable as a test oracle, to explore all behaviours or single 
paths of test programs; this is what lets us run our test suite 
against the model. Cerberus can also be instantiated with an 
operational C/C++ 11 concurrency model [37], though not 
yet combined with a full memory object model. 

Cerberus aims to cover essentially all the material of Sec- 
tions 5 Environment and 6 Language of the ISO Cll stan- 
dard, both syntax and semantics, except: preprocessor fea- 
tures, Cll character-set features, floating-point and com- 
plex types (beyond simple float constants), user-defined vari- 
adic functions (we do cover printf), bitfields, volatile, 
restrict, generic selection, register, flexible array mem- 
bers, some exotic initialisation forms, signals, long j mp, mul- 
tiple translation units, and aspects where our candidate de 
facto memory object model intentionally differs from the 
standard. It supports only small parts of the standard li- 
braries. Threads, atomic types, and atomic operations are 
supported only with a more restricted memory object model. 
For everything in scope, we aim to capture all the permitted 
semantic looseness, not just that of one or some implemen- 
tation choices. 

To manage some of the complexity, the Cerberus compu- 
tation semantics is expressed by an elaboration (a composi- 
tional translation) from a fully type-annotated C AST to a 
carefully defined Core: a typed call-by-value calculus with 
constructs to model certain aspects of the C dynamic seman- 
tics. Together these handle subtleties such as C evaluation or- 
der, integer promotions, and the associated implementation- 
defined and undefined behaviour, for which the ISO stan- 
dard is (largely) clear and uncontroversial. The elaboration 
closely follows the ISO standard text, allowing the two to be 
clearly related for accessibility. 

The Cerberus front-end comprises a clean-slate C parser 
(closely following the grammar of the standard), desugaring 
phase, and type checker. This required substantial engineer- 
ing, but it lets us avoid building in semantic choices about 
C that are implicit in the transformations from C source to 
AST done by compiler or CIL [35] front-ends. 

We also report on a preliminary experiment in translation 
validation (in Coq) for the front-end of Clang, for very sim- 
ple programs. 

All this is a considerable body of material: our design- 
space analysis alone is an 80+ page document; we refer 
to the extensive supplementary material [32] for that and 
for our test suite, survey results, and experimental data. We 
summarise selected design-space analysis in §2, apply it in 
§3 and §4, and summarise aspects of Cerberus in §5. We 
discuss validation and the current limitations of the model in 
§6, and related work in §7. 


2. Pointer and Memory Disagreements 

“Why do you have to ask questions that make me want to 
audit all my C code?” 

The most important de facto standards for C are those im- 
plicit in the billions of lines of extant code: the properties of 
the language implementations that all that code relies on to 
work correctly (to the extent that it does, of course). On the 
other side, we have the emergent behaviour that mainstream 
C compilers can exhibit, with their hundreds of analysis and 
optimisation passes. Both are hard to investigate directly, but 
usefully modelling C depends on the ability to define these 
real-world variants of C. Our surveys are, to the best of our 
knowledge, a novel approach to investigating the de facto se- 
mantics of a widely used language. We produced two. The 
first version, in early 2013, had 42 questions, with concrete 
code examples and subquestions about the de facto and ISO 
standards. We targeted this at a small number of experts, in- 
cluding multiple contributors to the ISO C or C++ standards 
committees, C analysis tool developers, experts in C formal 
semantics, compiler writers, and systems programmers. The 
results were very instructive, but this survey demanded a lot 
from the respondents; it was best done by discussing the 
questions with them in person over several hours. 

Our second version (in early 2015), was simplified, mak- 
ing it feasible to collect responses from a wider community. 
We designed 15 questions, selecting some of the most inter- 
esting issues from our earlier survey, asked only about the de 
facto standard (typically asking whether some idiom would 
work in normal C compilers and whether it was used in 
practice), omitted the concrete code examples, and polished 
the questions to prevent misunderstandings that we saw in 
early trials. We refer to these questions as [n/15] below. 
Aiming for a modest-scale but technically expert audience, 
we distributed the survey among our local systems research 
group, at EuroLLVM 2015, via technical mailing lists: gcc, 
llvmdev, cfe-dev, libc-alpha, xorg, freebsd-developers, xen- 
devel, and Google C user and compiler lists, and via John 
Regehr’s blog, widely read by C experts. There were 323 re- 
sponses, including around 100 printed pages of textual com- 
ments (the above quote among them). Most respondents re- 
ported expertise in C systems programming and many re- 
ported expertise in compiler internals and in the C standard: 


C applications programming 

255 

C systems programming 

230 

Linux developer 

160 

Other OS developer 

111 

C embedded systems programming 

135 

C standard 

70 

C or C++ standards committee member 

8 

Compiler internals 

64 

GCC developer 

15 

Clang developer 

26 

Other C compiler developer 

22 

Program analysis tools 

44 

Formal semantics 

18 

no response 

6 

other 

18 




We also used quantitative data on the occurrence of particu- 
lar idioms in systems C code from CHERI [11]. 

Our full set of 85 questions [10] addresses all the C 
memory object model semantic issues that we are currently 
aware of, including all those in these two surveys. We refer 
to these as Qnn below; they can be categorised as follows 
(with the number of questions in each category): 


Pointer provenance basics 

3 

Pointer provenance via integer types 

5 

Pointers involving multiple provenances 

5 

Pointer provenance via pointer representation copying 

4 

Pointer provenance and union type punning 

2 

Pointer provenance via IO 

1 

Stability of pointer values 

1 

Pointer equality comparison (with == or ! =) 

3 

Pointer relational comparison (with <, >, <=, or >=) 

3 

Null pointers 

3 

Pointer arithmetic 

6 

Casts between pointer types 

2 

Accesses to related structure and union types 

4 

Pointer lifetime end 

2 

Invalid accesses 

2 

Trap representations 

2 

Unspecified values 

11 

Structure and union padding 

13 

Basic effective types 

2 

Effective types and character arrays 

1 

Effective types and subobjects 

6 

Other questions 

5 


In this section we discuss some of the key points of 
disagreement between practitioners, implementers and the 
standard with regard to pointers and memory, referring to 
our survey and experimental results. Of the 85 questions, 

• for 39 the ISO standard is unclear; 

• for 27 the de facto standards are unclear, in some cases 
with significant differences between usage and imple- 
mentation; and 

• for 27 there are significant differences between the ISO 
and the de facto standards. 

2.1 Pointer Provenance 

Originally one could think of C as manipulating “the same 
sort of objects that most computers do, namely characters, 
numbers, and addresses”, Kernigan and Ritchie [24, p.2]. 
At runtime, for conventional C implementations, that is still 
basically true, but the current ISO standards involve more 
abstract values, for pointers, unspecified values, and typed 
regions of memory; they cannot be considered as simple bit- 
vector-represented quantities. Compile-time analyses rely on 
the more abstract notions to legitimise optimisation trans- 
forms, and exactly what they are leads to some of the most 
important and subtle questions about C. 

The ISO WG14 Defect Report DR260 Committee Re- 
sponse [53] declares that “The implementation is entitled 
to take account of the provenance of a pointer value when 
determining what actions are and are not defined.” . This is 
observable in practice for the following example from our 
test suite, adapted from DR260. 


EXAMPLE (provenance_basic_global„.yx. c): 

#include <stdio.h> 

#include <string.h> 
int y=2, x=l; 
int main() { 

int *p = &x + 1; 
int *q = &y; 

printf( "Addresses: p=%p q=%p\n" , (void*)p, (void*)q) ; 
if (memcmp(&p, &q, sizeof(p)) == 0) { 

*p = 11; // does this have undefined behaviour? 

printf("x=%d y=%d *p=%d *q=%d\n" ,x,y,*p,*q) ; 

> 

return 0; 

} 

If x and y happen to be allocated in adjacent memory, 
&x+l and &y will have bitwise-identical runtime representa- 
tion values, the memcmp will succeed, and p (derived from 
a pointer to x) will have the same representation value as 
the pointer to y (a different object) at the point of the up- 
date *p=ll. In a concrete semantics we would expect to 
see x=l y=ll *p=ll *q=ll, but GCC produces x=l y=2 
*p=ll *q=2 (ICC produces x=l y=2 *p=ll *q=ll). This 
suggests that GCC is reasoning, from provenance informa- 
tion, that *p does not alias with y or *q, and hence that the 
initial value of y=2 can be propagated to the final printf. 
Note that this is not a type-based aliasing issue: the pointers 
are of the same type, and the GCC result is not affected by 
-fno- strict -aliasing. 

DR260 suggests a semantics in which pointer values in- 
clude not just a concrete address but also provenance infor- 
mation, erased at runtime in conventional implementations, 
including a unique ID from the original allocation. That can 
be used in the semantics for memory accesses to check that 
the address used is consistent with the original allocation, 
which here lets the *p = 11 access be regarded as unde- 
fined behaviour. The existence of an execution with unde- 
fined behaviour means this program is considered erroneous 
and compilers are notionally entirely unconstrained in how 
they treat it — thus making the analysis and optimisation 
(vacuously) correct in this case. This general pattern is typi- 
cal for C: regarding situations as undefined behaviour puts an 
obligation on programmers to avoid them, but permits com- 
pilers to make strong assumptions when optimising. 

So far this is uncontroversial among C experts, though it 
may be surprising to anyone at first sight, and several models 
for C or for particular C-like languages have had semantics 
for pointers with some kind of (block-ID, offset) model. But 
it leads to many more vexed questions, of which we give a 
sample below. 

Q25 Can one do relational comparison (with <, >, <=, 
or >=) of two pointers to separately allocated objects 
(of compatible object types)? ISO clearly prohibits 
this [1, §6.5.8p5], but our surveys show that it is widely 
used, e.g. for global lock orderings and for collection- 
implementation orderings. Numerically (survey question 
[7/15]), we get: Will that work in normal C compilers? yes: 




191 (60%) only sometimes: 52 (16%), no: 31 (9%), don’t 
know: 38 (12%), I don’t know what the question is asking: 
3 (1%), and Do you know of real code that relies on it? yes: 
101 (33%), yes, but it shouldn’t: 37 (12%), no, but there 
might well be: 89 (29%), no, that would be crazy: 50 (16%), 
don’t know: 27 ( 8%). 

The only cases where it is widely thought to fail are for 
segmented architectures, but those are now quite uncommon 
(e.g. AS/400 and old x86 processors) except for some main- 
frame architectures. For mainstream C semantics, it seems 
more useful to permit it than to take the strict-ISO view that 
all occurrences are bugs. (One respondent encountered prob- 
lems for pointers spanning the middle of the address space). 
In our model, we can easily regard these relational compar- 
isons as ignoring the provenance information. 

In real C this is one of several ways in which concrete 
address values are exposed to programs (along with explicit 
casts to integer types, IO of pointers, and examination of 
pointer representation bytes). Abstract pointer values must 
also therefore contain concrete addresses, in contrast to those 
earlier C semantics that only had block IDs, and the seman- 
tics must let those be nondeterministically chosen in any way 
that a reasonable implementation might. 

Q9 Can one make a usable offset between two separately 
allocated objects by inter-object integer or pointer sub- 
traction, making a usable pointer to the second by adding 
the offset to a pointer to the first? Here again we see 
a conflict, now with C usage on one side and compilers 
and ISO on the other. In practice, this usage is uncommon, 
but the basic pattern does occur in important specific cases, 
e.g. in the Linux and FreeBSD implementations of per-CPU 
variables. However, current compilers sometimes do opti- 
mise based on an assumption, in a points-to analysis, that 
inter-object pointer arithmetic does not occur. 

How could this be resolved? One could argue that the us- 
ages are bugs and should be rewritten, but that would need- 
lessly lose performance; it seems unlikely to be acceptable. 
One could globally turn off those optimisations, e.g. with 
- f no - 1 ree - pta for GCC, but that too is a blunt instrument. 
One could adapt the compiler analyses to treat inter-object 
pointer subtractions as giving integer offsets that have the 
power to move between objects. It would be easy to define 
a corresponding multiple-provenance semantics, with prove- 
nances that could be wildcards or sets of allocation IDs, not 
just the singleton provenances implicit in DR260, but imple- 
mentations would have to be conservative where they could 
not determine that pointer subtractions are intra-object; that 
also might be too costly. Or, finally, one could require such 
usages to be explicitly annotated, e.g. with an attribute on the 
resulting pointer that declares that it might alias with any- 
thing. None of these are wholly satisfactory. We suspect the 
last is the most achievable, but for the moment our candidate 
formal model forbids this idiom. 


As for many of our questions, there are essentially social 
or political questions as to whether the ISO standard, com- 
mon usage, and/or mainstream compiler behaviour can be 
changed. That is beyond the scope of this paper; the most 
we can do is make a start on clearly articulated possibilities. 

Q5 Must provenance information be tracked via casts to 
integer types and integer arithmetic? The survey shows 
that it is common to cast pointers to integer types and back 
to do arithmetic on them, e.g. to store information in un- 
used bits. To define a coherent model we must either regard 
the result of casting from an integer type to a pointer as of 
a wildcard provenance or track provenance through integer 
operations. The GCC documentation states “When casting 
from pointer to integer and back again, the resulting pointer 
must reference the same object as the original pointer, oth- 
erwise the behavior is undefined.” , strongly suggesting the 
latter, but raising the question of what to do if there is no sin- 
gle “original pointer”. Our formal model associates prove- 
nances with all integer values, following the same at-most- 
one provenance model we use for pointers. 

Q2 Can equality testing on pointers be affected by 
pointer provenance information? The ISO standard 
explicitly permits equality comparison between pointers 
to separately allocated objects (of compatible types) [1, 
§6.5.9], but leaves open whether the result of such com- 
parison might depend on provenance (presumably simply 
because the text has not been systematically revised since 
DR260). Experimentally, our test suite includes cases where 
GCC regards two pointers with the same runtime represen- 
tation but different provenances as unequal if the allocations 
and comparison are in the same compilation unit but equal 
if they are split across two compilation units, showing that 
these compilers do take advantage of the provenance infor- 
mation if it is statically available. This is not a semantics that 
we would a priori choose as language designers, but it can 
be soundly modelled by making a nondeterministic choice 
at each such comparison whether to take provenance into 
account or not. Writing a formal semantics usefully forces 
us to consider such questions; it also reveals the language- 
definition-complexity cost of such optimisations. 

2.2 Out-of-Bounds Pointers 

The ISO standard permits only very limited pointer arith- 
metic, essentially within an array, among the members of a 
struct, to access representation bytes, or one-past an object. 
But in practice it seems to be common to transiently con- 
struct out-of-bounds pointers (Q31). Chisnall et al. found 
this in 7 of the 13 C codebases they examined [11, Table 
1]. So long as they are brought back in-bounds before be- 
ing used to access memory, many experts believe this will 
work; our survey (Question [9/15]) gave: yes: 230 (73%), 
only sometimes: 43 (13%), no: 13 (4%), don’t know: 27 
(8%). On the other side, the textual comments reveal that 
some compiler developers believe that compilers may op- 



timise assuming that this does not occur, and there can be 
issues with overflow in pointer arithmetic and with large al- 
locations. 

This is another clear conflict. For our formal model, as 
in CHERI, we tentatively choose to permit arbitrary pointer 
arithmetic: out-of-bounds pointers can be constructed, with 
undefined behaviour occurring if an access-time check 
w.r.t. the bounds associated to their provenance fails. 

2.3 Pointer Copying 

In a provenance semantics, as C lets one operate on the rep- 
resentation bytes of values, we have to ask when those opera- 
tions make usable copies of pointers, preserving the original 
provenance. The library memcpy must clearly allow this, but 
what about user code that copies the representation bytes, 
perhaps with more elaborate computation on the way (Q13- 
Q16)? Most survey respondents expect this to work (Ques- 
tion [5/15]): yes: 216 (68%), only sometimes: 50 (15%), 
no: 18 (5%), don’t know: 24 (7%). They give some inter- 
esting examples, e.g. “Windows /GS stack cookies do this 
all the time to protect the return address. The return ad- 
dress is encrypted on the stack , and decrypted as part of 
the function epilogue ” and a “JIT that stores 64-bit virtual 
ptrs as their hardware based 48-bits”. Our candidate for- 
mal model should permit copying pointer values via indirect 
dataflow, as the representation bytes or bits carry the origi- 
nal provenance, combining values with the same provenance 
preserves that provenance, and the access-time check com- 
pares the recalculated address with that of the original allo- 
cation. It will not permit copying via indirect control flow 
(e.g. making a branch based on each bit of a pointer value), 
and it intentionally does not require all of the original bits to 
flow to the result. We view this as a plausible de facto seman- 
tics, but more work is needed to see if it is really compatible 
with current compiler analysis implementations. 

2.4 Unspecified Values 

C does not require variables and memory to be initialised. 
Reading an uninitialised variable or struct member (either 
due to a bug or intentionally, to copy, output, hash, or set 
some bits of a partially initialised value), has several possi- 
ble semantics. Our survey (Question [2/15]) gave bimodal 
answers, split between (1) and (4): 

1. undefined behaviour (meaning that the compiler is free 
to arbitrarily miscompile the program, with or without a 
warning): 139 (43%) 

2. going to make the result of any expression involving that 
value unpredictable: 42 (13%) 

3. going to give an arbitrary and unstable value (maybe with 
a different value if you read again): 21 (6%) 

4. going to give an arbitrary but stable value (with the same 
value if you read again): 112 (35%) 


In the text responses, the only real use cases seem to be copy- 
ing a partially initialised struct and (more rarely ) comparing 
against one. It appears that current Clang, GCC, and MSVC 
are not exploiting the licence of (1), though one respondent 
said Clang is moving towards it, and one that (1) may be 
required for Itanium. Another makes a strong argument that 
the MSVC behaviour is more desirable for security reasons. 
But GCC and Clang do perform SSA transformations that 
make uninitialised values unstable (2); our test cases exhibit 
this for Clang. 

The ISO standard introduces indeterminate values, which 
are either an unspecified value (a “valid value [where ISO] 
imposes no requirements on which value is chosen in any in- 
stance” or a trap representation ; reading uninitialised val- 
ues can give undefined behaviour either if the type be- 
ing read has trap representations in this implementation, or 
(6.3.2. Ip2) if the object has not had its address taken. These 
definitions have given rise to much confusion over the years, 
but it seems clear that for current mainstream C, there are no 
trap representations at most types, perhaps excepting _Bool, 
floating-point, and some mainframe pointer types, and the 
6.3.2. Ip2 clause was intended to cover the Itanium case. 
Leaving those cases apart, (2) seems reasonable from the 
compiler point of view, in tension with (4), that may be relied 
upon by some code. 

2.5 Unspecified Padding Values 

Unspecified values arise also in padding bytes, which in C 
can be inspected and mutated via char * pointers. Here we 
see several possible semantics, including: 

1 . Padding bytes are regarded as always holding unspecified 
values, irrespective of any byte writes to them (so the 
compiler could arbitrarily write to padding at any point). 

2. Structure member writes are deemed to also write un- 
specified values over subsequent padding. 

3. ...or to nondeterministically either write zeros over sub- 
sequent padding or leave it unchanged. 

4. Structure copies might copy padding, but structure mem- 
ber writes never touch padding. 

Our survey (Question [1/15]) produced mixed results: some- 
times it is necessary to provide a mechanism that program- 
mers can use to ensure that no security-relevant information 
is leaked via padding, e.g. via (3) or (4); those also let users 
maintain the property that padding is zeroed, enabling deter- 
ministic bytewise CAS, comparison, marshalling, etc.. An 
MSVC respondent suggested it provides (4), and a Clang 
respondent that it does not require (1) or (2). But a GCC re- 
spondent suggested a plausible optimisation, scalar replace- 
ment of aggregates, which could require (1) or (2) to make 
the existing compiler behaviour admissible, and we under- 
stand that an IBM mainframe compiler may by default also 
require those. This is a real conflict. 



2.6 Effective Types 

C99 introduced effective types to permit compilers to do 
optimisations driven by type-based alias analysis (TBAA), 
ruling out programs involving unannotated aliasing of ref- 
erences to different types by regarding them as having 
undefined behaviour. This is one of the less clear, less 
well-understood, and more controversial aspects of C. The 
effective-types question of our preliminary survey was the 
only one which received a unanimous response: “don’t 
know”. Here again we find conflicts, for example for: 

Q75 Can an unsigned character array with static or au- 
tomatic storage duration be used (in the same way as a 
malloc’d region) to hold values of other types? 

In our survey (Question [11/15]), 243 (76%) say this will 
work, and 201 (65%) know of real code that relies on it, 
but a strict reading of the ISO standard disallows it, and a 
GCC contributor noted “No, this is not safe (if it’s visible to 
the compiler that the memory in question has unsigned char 
as its declared type". However, our candidate formal model 
focusses on the C used by systems code, often compiled with 
-fno- strict -aliasing to turn TBAA off, and so should 
permit this and related idioms. 

3. Memory Semantics of C Analysis Tools 

As an initial experimental investigation into this, we ran our 
tests with Clang’s memory, address, and undefined-behavior 
sanitisers (MSan, ASan, and UBSan), which are intended to 
identify uses of uninitialized memory values, invalid mem- 
ory accesses, and other undefined behaviors, the TrustlnSoft 
tis- interpreter [44] (based on the Frama-C value anal- 
ysis [9]), and Hathhorn et al.’s KCC [16, 18, 20], We dis- 
cussed the tis-interpreter results briefly with one of its 
authors. These three groups of tools gave radically different 
results. 

For the Clang sanitisers, we were surprised at how few 
of our tests triggered warnings. All 13 of our structure- 
padding tests and 9 of our other unspecified-value tests ran 
without any sanitiser warnings (including tests that triggered 
compile-time warnings). For example, Q49 passes an un- 
specified value to a library function, yet does not trigger a 
report, though MSan does detect a flow-control choice orig- 
inating from an unspecified value in Q50. MSan flagged two 
of the unspecified value tests (though only at -00). ASan and 
MSan did report errors on the two tests that rely on treat- 
ing an arbitrary integer value as a pointer (though these also 
caused segmentation faults without the sanitisers enabled), 
but neither flagged any of our other pointer provenance tests 
as dubious. This might be due to deliberate design choices to 
adopt a liberal semantics to accommodate the de facto stan- 
dards, or to limitations in the tools, or both. 

tis-interpreter aims for a tight semantics, to detect 
enough cases to ensure “that any program that executes cor- 
rectly in tis-interpreter behaves deterministically when 


compiled and produces the same results”. In many places 
it follows a much stricter notion of C than our candidate 
de facto model, e.g. flagging most of the unspecified-value 
tests, and not permitting comparison of pointer representa- 
tions; in some others it coincides with our candidate de facto 
model but differs from ISO, e.g. assuming null pointer rep- 
resentations are zero. Our tests also identified two bugs in 
the tool, acknowledged and fixed by the developers. 

KCC detected two potential alignment errors in earlier 
versions of our tests. But it gave ‘Execution failed’, with no 
further details, for the tests of 20 of our questions; ‘Transla- 
tion failed’ for one; segfaulted at runtime for one; and gave 
results contrary to our reading of the ISO standard for at 
least 6: it exhibited a very strict semantics for reading unini- 
tialised values (but not for padding bytes), and permitted 
some tests that ISO effective types forbid. 


4. Memory Semantics of CHERI C 

A significant amount of research has focused on memory- 
safe implementations of C, e.g. [14, 15, 22, 34, 35], includ- 
ing commercial implementations aimed at mass use, such as 
Intel’s MPX [21]. To date, no work in this area has clearly 
defined the interpretation of C that it aims to implement. 

The CHERI processor [54] extends existing instruction 
sets to support spatial memory safety. We have run our 
tests on the CHERI C implementation (Clang-based on an 
FPGA soft-core CPU), with a view to specialising our can- 
didate Cerberus de facto model to precisely characterise 
their intended semantics. We found several areas where the 
current CHERI implementation deviates from the expected 
behaviour. Some were known, e.g. correct provenance on 
pointers to globals requires linker support which is not yet 
completed. Others were surprising. For example, the CHERI 
pointer equality had two pointers with different provenance 
compare equal, but not be interchangeable. This was ad- 
dressed by the CHERI developers adding a new compare- 
exactly-equal to their instruction set to compare pointers 
by both their address and their metadata, to use for pointer 
comparison. A more subtle issue was discovered in a test 
where (i & 3u) == 0u (where i is a uintpt r_t) evaluated 
to false, even though the low three bits in i are all zero. This 
was caused by the result of ( i & 3u ) being the fat pointer i 
with its offset set to the offset of i anded with 3, which gives 
a non-zero value. This case is particularly interesting as it 
is triggered by an assertion in the test: the underlying idiom 
does work on CHERI, but defensively written code will fail. 
We are still working with the CHERI C developers to de- 
termine the best solution to this issue. We also helped codify 
the CHERI C constraints in a number of places. For example, 
its non-intptr t integer values do not carry pointer prove- 
nance, and provenance in arithmetic expressions is only in- 
herited from the left-hand side. 
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Figure 1. Cerberus architecture (with LOS counts) 

5. Cerberus 

We now turn from the specifics of pointer and memory be- 
haviour to our broader Cerberus semantics for a substantial 
fragment of C. 

5.1 Architecture: Typed Elaboration into Core 

The dynamic semantics of programming languages are often 
defined as operational semantics over abstract syntax types 
(ASTs) relatively close to the source language. Some pre- 
vious work formalising large fragments of the C standard 
does this, e.g. Papaspyrou [40] and Ellison and Rosu [16], 
but it can lead to an unnecessarily monolithic semantics. 
Many of the dynamic intricacies of C relate to essentially 
compile-time phenomena, e.g. the pervasive implicit coer- 
cions for integer expressions and the loose specification of 
expression evaluation order. We manage complexity by fac- 
toring our semantics accordingly, into an elaboration that 
translates an explicitly typed source-like AST into a simpler 
Core language, the operational semantics of that, and either 
the memory object or concurrency model. Refinements to 
the semantics, to match some particular de facto standard, 
should largely affect only the first and last, leaving Core un- 
changed. Our elaboration function is total, and designed to 
produce well-typed Core programs. Both it and our Core op- 
erational semantics are as compositional as we can arrange: 
elaboration is inductive on the Typed AST program structure 
except that we precompute the lists of labels of C case state- 
ments; the Core operational semantics is a structural opera- 
tional semantics over a state that essentially has just a Core 
expression and stack of continuations. By selecting an ap- 
propriate sequencing monad implementation, we can select 
whether to perform an exhaustive search for all allowed ex- 
ecutions or pseudorandomly explore single execution paths. 

The architecture of Cerberus is shown in Fig. 1. After 
conventional C preprocessing, the front end starts with a 
Menhir [41] parser producing values of an AST type Cabs, 
closely following the ISO grammar. The Cabs_to_Ail desug- 


aring pass produces a more convenient AST, Ail. This han- 
dles many intricate aspects that might be omitted in a small 
calculus but have to be considered for real C, simplifying 
our later typechecking and elaboration. It covers: identi- 
fier scoping (linkage, storage classes, namespaces, identifier 
kinds); function prototypes and function definitions (includ- 
ing hiding, mutual recursion, etc.); normalisation of syntac- 
tic C types into canonical forms; string literals (which are 
implicitly allocated objects); enums (replacing them by in- 
tegers and adding type annotations); and desugaring for- 
and do-while loops into while. Where possible, it is struc- 
tured closely following the standard, and if it fails due to 
an ill-formed program, it identifies exactly what part of the 
standard is violated. Type checking adds explicit type an- 
notations, and likewise identifies the relevant parts of the 
standard on any failure. The Cabs_to_Ail and typechecking 
passes operate without requiring any commitment to how C- 
standard implementation-defined choices are resolved. The 
main elaboration produces Core AST from Typed Ail. After 
optional Core-to-Core simplification, a driver combines the 
Core thread-local operational semantics with either our can- 
didate de facto standard sequential memory object model or 
the Cl 1 operational concurrency model. 

All of this except the parser is formally specified in 
Lem [33] in a pure functional monadic style, to give it 
a clear computational intuition and permit straightforward 
code generation for our executable tool. It comprises around 
19 000 non-comment lines of specification (LOS), plus 2600 
lines of parser. 

5.2 Core Overview 

Core is intended to be as minimal as possible while remain- 
ing a suitable target for the elaboration, and with the be- 
haviour of Core programs made as explicit as possible. It is a 
typed call-by-value language of function definitions and ex- 
pressions, with first-order recursive functions, lists, tuples, 
booleans, mathematical integers, a type of the values of C 
pointers, and a type of C function designators, for function 
pointers. Core also includes a type ctype of first-class val- 
ues representing C type AST terms, and various tests on it, to 
let Core code perform computations based on those, e.g. on 
whether a C type annotation in the Typed Ail source of the 
elaboration is a signed or unsigned integer type. The syntax 
of Core is in Fig. 2. 

The Core type system maintains a simple distinction be- 
tween pure and effectful expressions, e.g. allowing only pure 
expressions in the Core if test, and the elaboration maps as 
much as possible into the pure part, to ease reasoning. 

C variables are mutable but Core just has identifiers that 
are bound to their values when introduced. Interaction with 
the memory object and concurrency models is factored via 
primitive Core memory actions (a) for static and dynamic 
object creation, kill, load, store, and read-modify-write. 

To capture the intricacies of C runtime dynamics com- 
positionally (using structural operational semantics), we add 



oTy : : = types for C objects 
| integer 
| floating 
| pointer 
| cfunction 
| array (oTy) 

| struct tag 
| union tag 


bTy ::= Core base types 
| unit unit 

| boolean boolean 

| ctype Core type of C type exprs 

| [bTy] list 


( bTy i 1 ) tuple 

oTy C object value 

loaded oTy oTy or unspecified value 


coreTy ::= Coretypes 
| bTy pure base type 
| ef f bTy effectful base type 


object jualue ::= C object values 
| intval 
| floatval 
| ptrval 
| name 


integer value 
floating-point value 
pointer value 
C function pointer 
C array value 


| array ( objectjvaluei ) 

| ( struct tag){ .member i = memvali % } C struct value 
| (union tag ) { . member = memval } C union value 

value Core values 
| object jualue C object value 

| Specified (object jualue) non-unspecified loaded value 


| Unspecified ( ctype) 
j Unit 
| T rue 
j False 

I ctype 

| bTy[value\, .. ,value n 
| {value i, .. ,value n ) 


unspecified loaded value 

unit 

true 

false 

C type expr as value 

list 

tuple 


ptrop ::= pointer operations involving the memory state 

| pointer-equality-operator pointer equality comparison 

| pointer-relational-operator pointer relational comparison 

| ptrdiff pointer subtraction 

| in t F romPt r cast of pointer value to integer value 

| pt rF romlnt cast of integer value to pointer value 

| pt rValidForDeref dereferencing validity predicate 


a memory actions 

| create (pei ,pe2 ) 

| alloc {pei ,pe2 ) 
j kill {pe) 

| store {pei ,pe2 ,pc, memory- order) 

| load {pei ,pe2 , memory-order) 

| rmw (pei ,pe 2 ,pe 3 , pe4 , memory-order 1 , memory-order 2 ) 

pa ::= memory actions with polarity 

| a positive, sequenced by both let weak and let strong 

| neg (a) negative, only sequenced by let strong 


pat :: = 

1 _ 

| ident 
| ctoripati, 


wildcard pattern 
identifier pattern 
, pat n ) constructor pattern 


pe ::= Core pure expressions 

| ident 

| <impl-const> 

| value 

| undef ( ub-name ) 

| error (string, pe) 

| ctoripei, .. ,pe n ) 

| case pe with \pati=>pei 1 end 
| array_shift (pei , ctype, pe2) 

| member_ shift {pe, tag .member) 

| not (pe) 

| pei binoppe2 

| (struct tag) { .member i = pei z } 

| ( union tag) { .member =pe} 

| name {pe\ ,pe n ) 

| let pat = pe\ in pe 2 
| if pet hen pei else pe2 
| is_scalar(pe) 

| is_integer(pe) 

| is_signed (pe) 

| is_unsigned (pe) 

e ::= Core expressions 
| pure (pe) 

| ptrop {ptrop, pei, .. ,pe n ) 

I PCL 

| case pe with | pati=>ei x end 
| let pat- peine 
| if pe then ei else e2 
I skip 

| pcall ( pe,pe\ , .. ,pe n ) 

| return (pe) 

| unseq (ei, .. ,e n ) 

| let weakpat = ei in e2 
| let strong pat = ei in e2 
| let atomic (sgm : oTg) =ai in pa2 
| indet [n] (e) 

| bound [n] (e) 

| nd (ei , .. ,e n ) 

| save label { identi : ctypei % ) in e 
| run label { identi : = pei 1 ) 

| par (ei , .. ,e n ) 

| wait {thread-id) 


Core identifier 

implementation-defined constant 
value 

undefined behaviour 

impl-defined static error 

constructor application 

pattern matching 

pointer array shift 

pointer struct/union member shift 

boolean not 

binary operators 

C struct expression 

C union expression 

pure Core function call 

pure Core let 

pure Core if 


pure expression 

pointer op involving memory 

memory action 

pattern matching 

Core let 

Core if 

skip 

Core procedure call 
Core procedure return 
unsequenced expressions 
weak sequencing 
strong sequencing 
atomic sequencing 
indeterminately sequenced expr 
. . .and boundary 
nondetermini Stic sequencing 
save label 
run from label 
cppmem thread creation 
wait for thread termination 


definition ::= Core definitions 

| fun name { identi : bTy i % ) : bTy : = pe Core function definition 
| p roc name ( identi : bTyi * ) : ef f bTy : = e Core procedure definition 

The result of elaborating a C program is a set of Core declarations to- 
gether with the name of the startup (main) function; a set of struct and 
union type definitions; a set of names, core types, and allocation/initialisa- 
tion expressions for C objects with static storage duration; the definitions of 
implementation-defined constants (some of which are Core functions); and 
a library of Core utility functions and procedures used by the elaboration. 


Here tag , member , memory -order, pointer -equality -operator, and pointer -relational -operator are as in the C syntax, 
ctype ranges over representations of C type expressions, label ranges over C labels, and ub-name ranges over identifiers for C 
undefined behaviours. The ident are Core identifiers, and name ranges over C function names, Core function/procedure names, 
and implementation-defined constant names <impl-const>. The binop and ctor range over Core binary operations and value 
constructors (corresponding to the Core value productions), n and thread -id are natural numbers. Finally, intval, floatval, 
ptrval, and memval are the representations of values from the memory layout model, containing provenance information as 
appropriate and symbolically recording how they are constructed (these are opaque as far as the rest of Core is concerned). The 
figure shows the concrete syntax for Core used by the tool; it is mechanically typeset from an Ott grammar that also generates 
the Lem types used in the model for the Core abstract syntax. We have elided Core type annotations in various places, C source 
location annotations, and constructors for values carrying constraints. 


Figure 2. Core syntax 



6.5.7 Bitwise shift operators 

Syntax 

1 shift-expression: 

additive-expression 

shift-expression « additive-expression 
shift-expression » additive-expression 

Constraints 

2 Each of the operands shall have integer type. 

Semantics 

3 The integer promotions are performed on each of the operands. 
The type of the result is that of the promoted left operand. If 
the value of the right operand is negative or is greater than or 
equal to the width of the promoted left operand, the behavior is 
undefined. 

4 The result of El « E2 is El left-shifted E2 bit positions; va- 
cated bits are filled with zeros. If El has an unsigned type, 
the value of the result is Elx2 E ^, reduced modulo one more 
than the maximum value representable in the result type. If El 
has a signed type and nonnegative value, and Elx2 E ^ is repre- 
sentable in the result type, then that is the resulting value; oth- 
erwise, the behavior is undefined. 


[el « e2\ = 
sym_el 
sym_objl 
sym_prml 
sym_res 
core_el 
E. return! 

let weak (sym_el , sym_e2) 
pure( 

case (syra_el, sym_e2) with 


= E . fresh-symbol ; syra_e2 
= E . fresh-symbol ; sym_obj2 
= E . fresh-symbol; sym_prm2 
= E . fresh-symbol; 

= [el]; core_e2 := [e2]; 


= E . fresh-symbol ; 
= E . fresh-symbol ; 
= E . fresh-symbol ; 


unseq (core_el , core_e2) in 


5 ... similarly for El » E2 . . . 


I 



(_, Unspecified(-) ) => 
undef (Exceptional-condition) 

(Unspecified(-) , _) => 

(IF is_unsigned_integer_type(ctype_of el) THEN 
Unspecified ( result_ty) 

ELSE 

undef (Exceptional-condition) ) 

(Specified (sym_objl) , Specified (sym_obj2)) => 

■*-let sym_prml = 

irteger_promotion ( ctype_of el) sym_objl in 
let sym_prm2 = 

integer_promotion ( ctype_of e2) sym_obj2 in 
if sym_prm2 < 0 then 
undef (Negative- shift ) 

else if ctype_width(result_ty) <= sym_prm2 then 
undef (Shift- t 00 - large) 
else 

(IF is_unsigned_integer_type(ctype_of el) THEN 
Specified ( sym_prml*( 2~sym_prm2 ) 

rem_t (Ivmax(result_ty)+1) ) 

ELSE 

if sym_prml < 0 then 

undef ( Exceptional- condition) 
else 

let sym_res = sym_prml*(2 / 'sym_prm2) in 
if is-represertable(sym-res , result-ty) then 
Specified (sym_res) 
else 

undef (Exceptional-condition) ))) 


The elaboration [■] is a Lem function that calculates the Core expression that a C expression or statement elaborates to. Here 
we show the definition for C left-shift expressions el « e2, described in Section 6.5.7 of the ISO Cll standard on the left. 
The elaboration, on the right, starts by constructing fresh Core identifiers and recursively calculating the elaborations of el and 
e2, in a Lem monad for fresh identifiers. In the main body, the IF ... THEN ... ELSE ... are Lem conditionals, executed at 
elaboration time, while the lower-case blue keywords are parts of the calculated Core expression, constructors of the Lem types 
for the Core AST that are executed at runtime by the Core operational semantics. The former make use of two Lem functions: 
ctype_of and is_unsigned_integer_type. C expressions are effectful and so the calculated elaborations of the two operands 
(i core_el and core e2) are impure Core expressions that need to be sequenced in some way. This is specified elsewhere in the 
standard, not in 6.5.7: Clause 6.5pl says “value computations of the operands of an operator are sequenced before the value 
computation of the result of the operator ”, modelled with the let weak; and Clause 6.5p2 states that “side effects and value 
computations of subexpressions are unsequenced”, captured by the unseq. The remainder of the elaboration only performs 
type conversions and arithmetic calculations and so is a pure Core expression. The case expresses our chosen de facto answers 
to Q43 and Q52: unspecified values are considered daemonically for identification of possible undefined behaviours and are 
propagated through arithmetic; the rest captures the ISO left-shift text point-by-point, as shown by the arrows. Clause 6.5.7p2 
is captured in our typechecker, not in the elaboration (result_ty is the appropriate promoted C type). The calculated Core 
contains some Core function calls to auxiliaries: integer_promotion, ctype_width, and is_representable. We elide details 
in the Lem development of the formation of these calls and the construction of Core type annotations. 


Figure 3. Sample extract of Cll standard and the corresponding clause of the elaboration function 


novel constructs to express the C evaluation order, a goto an- 
notated with information about the C block boundary traver- 
sals involved, and nondeterministic choice and parallelism; 
all these are explained below (these last three are also con- 
sidered to be effectful). 

5.3 A Sample Excerpt of the Elaboration 

In Fig. 3 we show a sample excerpt from the Cll standard, 
for left-shift expressions, and the corresponding clause of 
our elaboration function, mapping Typed Ail left- shift ex- 
pressions into Core. It uses our constructs for weak sequenc- 
ing (let weak), unsequencing (unseq ( )), and undefined be- 
haviour (undef ( )). These are explained below, but one can 
already see that the elaboration and the standard text are 
close enough to relate one to the other — as one would hope, 
as this is one of the clearer parts of the standard. Making the 
whole of Cerberus accessible to a motivated but practical 
audience will require further work on presentation, but we 
believe this gives some grounds for optimism. 

5.4 Undefined Behaviour 

Undefined behaviour can arise dynamically in two ways: 
where a primitive C arithmetic operation has undefined be- 
haviour for some argument values, and from memory ac- 
cesses (unsafe memory accesses, unsequenced races, and 
data races). For the former, our elaboration simply intro- 
duces an explicit test into the generated Core code, as in the 
several uses of undef ( ) in Fig. 3. If the Core operational se- 
mantics reaches one of these it terminates execution and re- 
ports which undefined behaviour has been violated (together 
with the C source location). This is analogous to the inser- 
tion of runtime checks for particular undefined behaviours 
during compilation, as done by many tools, except that (a) 
it is more closely tied to the standard, and (b) in Cerberus’s 
exhaustive mode, it can detect undefined behaviours on any 
allowed execution path, not just those of a particular com- 
pilation. The latter are detected by the memory object or 
concurrency models, using calculated sequenced-before and 
happens-before relations over actions. 

5.5 Arithmetic 

The many C integer types (char, short, int, int32_t, etc.) 
cause frequent confusion and compiler errors [42], with their 
various finite ranges of values, signed and unsigned variants, 
different representation sizes and alignment constraints, and 
implicit conversions between integer types. For example, the 
C expression - 1 < (unsigned int ) 0 , perhaps surprisingly, 
can evaluate to 0 (false) (and typically does on x86-64). All 
this is captured in Cerberus by the elaboration, case-splitting 
over the inferred types as necessary, with Core computation 
simply over the mathematical integers. 


5.6 Sequencing 

The C standard is quite precise when it comes to the eval- 
uation order of expressions, but it is a loose specification. 
For example, consider the C statement w = x++ + f ( z , 2 ) ; . 
Its sequencing semantics is 
shown on the right as a 
graph over the memory ac- 
tions (reads R, writes W, 
object creations C, and ob- 
ject kills K); the solid arrows 
represent the sequenced be- 
fore relation of the standard’s 
concurrency model; the dou- 
ble arrow additionally links 
two actions into an atomic 
unit, and the dotted lines in- 
dicate indeterminate sequencing. Here (1) the evaluations of 
the operands of + are unsequenced with respect to each other; 
(2) the read of x and the body of f ( ) are sequenced before 
the store to w; (3) the read and write of x are atomic, pre- 
venting other memory actions occurring between them; (4) 
for each argument of f ( ) , first it is evaluated, then a tempo- 
rary object is created and the value of the argument written 
to it; all that is before the body of f ( ) but unsequenced with 
respect to other arguments; and (5) after the body of f ( ) the 
temporary objects are killed before the return value is used 
in the write to w. Finally, (6) the body of f ( ) is indetermi- 
nately sequenced with respect to everything with which it 
would otherwise be unsequenced, namely the read and write 
of x. This means that in any execution and in any occur- 
rence of the statement, the body of f ( ) is sequenced-before 
or sequenced-after (one way or the other) with respect to 
each of those two accesses; as the latter are atomic, it must 
be sequenced one way or the other with respect to both. 

To capture this cleanly, we introduce several novel se- 
quencing forms into the Core expression language. 

| unseq (ei , , e„ ) unsequenced expressions 

| let weakpal = ei in e 2 weak sequencing 

| let st rong pat = e\ in e 2 strong sequencing 

| let atomic (sym : oTy) = cn in pa 2 atomic sequencing 

| indet[n](e) indeterminate sequencing 

| bound [n](e) .. .and boundary 

| nd (ei, .. ,e n ) nondeterministic seq. 

The first permits an arbitrary interleaving of the memory ac- 
tions of each of the ei (subject to their internal sequenced- 
before constraints ) and reduces to a tuple of their values. The 
next two bind the result of e\ to the identifiers in the pattern 
pat and to some extent sequence the memory actions of e\ 
before those of e- 2 . 

Consider the semantics of C assignments when combined 
with postfix increment and decrement. In specifying that 
evaluation of the right operand of an assignment occurs be- 
fore the assigning store [52, §6.5.16#3], the standard re- 





stricts that ordering to the “value computation”. Intuitively 
this is part of the expression evaluation impacting the fi- 
nal value, and in the postfix increment/decrement case, the 
“value computation” excludes the incrementing/decrement- 
ing store. To capture this in Core, we annotate memory ac- 
tions with polarities; those that are not part of a value com- 
putation are negative. The strong sequencing operator makes 
all actions of <?i sequenced-before all actions of e 2 , while 
the weak sequencing operator only makes the positive ac- 
tions of e\ before the actions of e 2 . Atomic sequencing is 
required only for the postfix increment and decrement op- 
erators: the let atomic makes the first memory action 
sequenced before the second pa 2 and prevents indetermi- 
nate sequencing putting other memory actions between them 
(shown with a double arrow on the left above). Finally, to 
calculate the possible sequenced-before orderings of any in- 
determinately sequenced function bodies in an expression, 
we have a construct to label subexpressions as indetermi- 
nately sequenced with respect to their context, and another to 
delimit the relevant part of that context (corresponding to the 
original C expression). A Core-to-Core transformation (ex- 
pressed as a rewrite system using those operators) converts 
any expression into a nondeterministic choice between Core 
expressions that embodies all the possible choices; indet 
and bound do not appear in the result. 

The point of all this expression-local nondeterminism 
is not to permit user-visible nondeterminism, but rather to 
permit optimisation. If two accesses to the same object are 
unrelated by the sequenced-before relation then they form an 
unsequenced race and the program has undefined behaviour 
(or, in other words, the compiler can assume that there are 
no two such accesses). Our semantics can detect all such 
unsequenced races. 

5.7 Lifetime 

Memory objects in C have lifetimes depending on their 
scope, storage class, and, for dynamically allocated objects, 
the calls to allocation and deallocation functions. A precise 
model of object lifetime is needed to detect illegal memory 
actions, including accessing objects outwith their lifetime 
and concurrency-related illegal actions such as data races. 

Lifetime is made explicit in our elaboration into Core: 
Core has no notion of block, and memory objects are all 
created and killed using the explicit memory actions intro- 
duced by the elaboration. The Core c reate and alloc mem- 
ory actions take an alignment constraint and either a ctype 
or allocation size respectively; they reduce to what corre- 
sponds to the C value of a pointer to the created object. The 
load, store, and rmw memory actions all take the ctype 
from the lvalue of the original C expression as their first ar- 
guments, together with values and addresses as appropriate. 
These are explicitly required to be sequenced, and then there 
are also ptrop pointer operations, which in some memory 
layout models need to access (or even mutate) the memory 
state, and so also must be sequenced. 


5.8 Loops and Goto 

Core provides recursive functions and a pure if construct. 
C statements, including goto, could be elaborated using 
these alone, with a CPS transformation of the C program, 
but this would lead to a serious loss of the source code 
structure and/or excessive code duplication, making it hard 
to relate the Core evaluation back to the original C program. 
We instead equip Core with labels and a primitive goto-like 
construct, giving semantics to those using continuations in 
the Core dynamic semantics. 

This means that we can uniformly elaborate the various 
C control-flow constructs: the loops for, while, and do, 
the break and continue statements to prematurely exit a 
loop or skip an iteration, and the switch statement. All are 
elaborated using Core labels and the Core save and run, 
keeping the original code structure. For example, C of the 
form on the left below is elaborated into the Core on the 
right: 

while (e) { save l ( ) in 

Si; let strong id = [e| in 

break; if id = 0 then 

s 2', save b() in skip 

}; else 

let strong _ = [si] in 
let strong _ = run b ( ) in 
let strong _ = [S2] in 
run l() 

Handling the interaction between the C goto and the lifetime 
of block-scoped variables in C requires the Core goto to do 
more. If a C goto is used to jump in the middle of a block, the 
local objects’ lifetimes start when that jump is performed, 
as in the C goto l; { int x = 0; l: SI; }. Similarly, 
jumping out of a block ends the lifetime of the local objects. 
The target of any ISO C goto is statically determined (we 
do not support GCC computed gotos), so our elaboration 
can record sufficient information to handle this: it records 
the set of visible objects as it goes down through blocks, 
then it annotates the Core saves and runs with the Core 
symbolic names and ctypes associated with those objects. 
The dynamics of the Core run does a create for all the 
objects which are in scope at the target save but not at the 
goto, and a kill for all the converse. 

5.9 Our Candidate De Facto Memory Object Model 

Expressing the de facto memory object model that we 
sketched above requires enriched definitions for pointer, in- 
teger, and memory values. Pointer values and integer values 
all contain a provenance, either empty (for the NULL pointer 
and pure integer values), the original allocation ID of the 
object the value was derived from, or a wildcard (for point- 
ers from 10). Most arithmetic involving one provenanced 
value and one pure value preserves the provenance, while 
subtraction of two values produces a pure integer (to use 



as an offset). Arithmetic involving two values with distinct 
provenance also produces a pure integer, to ensure the result 
cannot be used between them (this prevents the per-CPU- 
variable case without additional annotation). Pointer values 
also include either a symbolic base (essentially an alloca- 
tion id) or a base cast from an integer value, together with a 
symbolic offset (to shift among struct members and arrays). 
Integer values can be concrete numbers, addresses of mem- 
ory objects, symbolic arithmetic (for offsets ), casts of pointer 
values to integers, pointer diffs, a byte of a memory value, 
bytewise composites, sizeof , of f setof , and _Alignof , and 
the maximum and minimum values of integer types. Mem- 
ory values can be either unspecified, an integer value of a 
given integer type, a pointer, or an array, union, or struct of 
memory values. 

We outline how this gives the desired behaviour for two 
of the questions from §2. For the basic provenance exam- 
ple of §2.1, it will construct fresh IDs at allocation time for 
the objects associated to y and x; the pointer values formed 
for &x and &y will have those provenances; and the pointer- 
plus-pure-integer arithmetic &x+l will preserve the x prove- 
nance. The memcmp will examine only the pointer representa- 
tion bytes, not their provenances. Then one hits a tool design 
choice: one could either make nondeterministic choices for 
the concrete addresses of allocations (at allocation-time or 
more lazily) or treat them symbolically and accumulate con- 
straints; both are useful. With the first, the memcmp can be 
concretely resolved to one of its three cases; with the sec- 
ond, one can make a nondeterministic choice between them 
and add the appropriate constraints (better for exhaustive 
calculation of all possible outcomes). In either semantics, 
following the “equals” branch, the semantics will check, at 
the *p=ll access, whether the address of the pointer value 
(and width of the int) are within the footprint of the origi- 
nal allocation associated with the provenance of the pointer 
value. In this case it is not, so the semantics will flag an un- 
defined behaviour and the tool will terminate. 

For the pointer copying of §2.3, copying pointer values 
by copying their representation bytes (directly or indirectly) 
will work because those representation bytes (qua integer 
values) will carry the provenance of the original pointer 
value, and pure integer arithmetic will preserve that, as will 
the reconstruction of a pointer value that is read from mul- 
tiple byte writes. Copying pointer values via control-flow 
choices will not because a control-flow choice on a pointer 
value, e.g. a C if, will not carry the pointer’s provenance to 
values constructed in the “true” branch. 

6. Validation 

Cerberus is intended principally as a semantic definition that 
captures all the looseness of the C standard. We make it 
executable to explore the semantics of difficult test cases 
and enable comparison w.r.t. implementations, but that very 
looseness makes execution combinatorially challenging. It 


is not intended as a tool for production-sized C programs, 
but it does support small “real C” test cases. We validate 
this on the small tests from Ellison and Rosu [16]. Of their 
561 Csmith tests [55], Cerberus currently gives the same re- 
sult as GCC for 556; the other 5 time-out after 5min. The 
GCC Torture tests they mention are mostly not syntactically 
correct Cll programs, which we do not currently support. 
We also tried 400 larger Csmith tests, 40-600 lines long. 
Of these 22 did not terminate in GCC, presumably a bug 
in Csmith (the second Csmith bug we found). Cerberus ter- 
minates and agrees with GCC on 316, times out (after 30s) 
on 56 more, and fails on 6. Our de facto tests are much 
more demanding, and for these our candidate model, which 
is still work in progress, currently has the intended behaviour 
only for 9. Many of the failures are quite exotic, e.g. involv- 
ing implementation-defined typing involving over-large con- 
stants, and questions about whether the result of sizeof is 
representable in types other than size t. but there is still 
considerable work to do. 

To demonstrate that Cerberus can be used as a basis for 
mechanised proof, we implemented a prototype translation 
validator, t vc, for the front-end of Clang. It supports only ex- 
tremely simple single-function C programs that perform no 
I/O, take no arguments, and meet several additional restric- 
tions. Given such a program, tvc produces a mechanised 
Coq proof that the behaviours of the IR produced by the 
compiler are a subset of those allowed by Cerberus (strictly, 
by the version current when tvc was produced). The for- 
mer is defined by the Vellvm non-deterministic semantics of 
LLVM IR [56]; the latter the Cerberus elaboration and Lem’s 
translation into Coq of the Core semantics. Although tvc is 
extremely limited, it shows that Cerberus can be used as the 
basis for mechanised proof about individual C programs, and 
it is notable in being a translation validator for the front end 
of a production compiler. 

7. Related Work 

Several groups have worked to formalise aspects of the ISO 
standards, including Gurevich and Higgens [17], Cook and 
Subramanian [13], Norrish [38, 39], Papaspyrou [40], Batty 
et al. [3], Ellison et al. [16, 18], and Krebbers et al. [25- 
29]. Krebbers [27, Ch.10] gives a useful survey. The main 
difference with our work is that we are principally concerned 
with the more complex world of the de facto standards, 
especially for the memory object model, not just the ISO 
standard; to the best of our knowledge no previous work has 
systematically investigated this. 

Another important line of related work builds memory 
object models for particular C-like languages, including 
those for CompCert by Leroy et al. [30, 31], the proposals 
by Besson et al. [6, 7], the model used for seL4 verification 
by Tuch et al. [45], the model used for VCC by Cohen et 
al. [12], and the proposal by Kang et al. [23]. These are not 
trying to capture either de facto or ISO standards in general, 



and in most respects are aimed at rather better-behaved code 
than we see. CompCert started with a rather abstract block- 
ID/offset model [31]; the later proposals are relaxed to sup- 
port more de facto idioms. Ellison and Rosu [16] also used 
a simple block model, while Hathhorn et al. [18] add some 
effective type semantics (though see §3), and Krebbers has 
adopted a very strong interpretation of effective types. We 
refer the reader to our design-space document [10] for more 
detailed comparison with related work, especially with re- 
spect to memory layout model issues. 

None of the above addresses realistic general-purpose 
concurrency in C except that of Batty et al., who only 
study concurrency. The closest are the TSO model of Com- 
pCertTSO [46] and the KCC extension with TSO [19]. In 
contrast to Cll concurrency, supporting TSO can be done 
with little impact on the threadwise semantics. 

Comparing with the KCC and Krebbers et al. work in 
more detail, there are also important differences in seman- 
tic style. Cerberus is expressed as a pure functional speci- 
fication in Lem (extractable to OCaml for execution and to 
prover definitions for reasoning), and our factorisation via 
Core helps manage the complexity. Ellison et al. give a more 
monolithic semantics in the K rewriting framework, aiming 
primarily at executability. It is defined directly on C abstract 
syntax. Overall, the state of the dynamic semantics contains 
over 60 environments - viable for execution, but problem- 
atic for understanding or for proof. This semantics has been 
relatively well validated by testing against GCC behaviour. 
Krebbers et al. are focussed instead on metatheory, with type 
system, operational semantics, and program logics defined 
in Coq and mechanised theorems relating them. Similar to 
Cerberus, their dynamic semantics is factored via a core cal- 
culus, but one rather closer to C than our Core. They also 
describe an interpreter, extracted from Coq, used on “a small 
test suite for both defined and undefined behavior” [29] but 
apparently so far without more extensive validation. 

In terms of coverage of C syntactic features, Cerberus lies 
between those two: it does not currently support bitfields, 
floats, longjmp, user-defined variadic functions, or multiple 
translation units, which Ellison et al. cover in some form [16, 
18, Fig. 1] (though they are not capturing all the semantic 
looseness of C, e.g. treating float operations by executing 
them in a particular implementation). 

Then there is a very extensive literature on static and 
dynamic analysis for C, and systems-oriented work on 
bug-finding tools (including tools such as Valgrind [36], 
Stack [48], and the Clang sanitisers). Exactly where each of 
these (and the models above) lie in our articulation of the de- 
sign space is an interesting question for detailed future work, 
which the current paper enables. 

8. Conclusion 

The semantics of C has been a vexed question for much of 
the last 40 years, and the interactions between compiler op- 


timisations, the corpus of systems code, analysis tools, and 
safety and security properties make it increasingly critical. 
Our multifaceted investigation into aspects of the de facto 
and ISO standards should help clarify the situation; it en- 
ables a range of future work on testing, semantics, analysis, 
verification, and standardisation. 
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